Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding of non-ascii contents written to parameter files. #18972

Closed
wants to merge 5 commits into from

Conversation

zhengwei143
Copy link
Contributor

@zhengwei143 zhengwei143 commented Jul 18, 2023

When args are written to parameter files, non-ascii values are wrongly encoded again as utf-8. This seems to be unaffected by the JDK20 upgrade of Bazel, and has always been happening.

Repro:

$ cat encoding/BUILD$ cat encoding/BUILD
load("defs.bzl", "cat")

cat(
    name = "test_cat",
    out = "cat.txt",
    content = "привет",
)

sh_binary(
    name = "cat_bin",
    srcs = ["test_cat.sh"],
)

$ cat encoding/defs.bzl
def _test_cat_impl(ctx):
  args = ctx.actions.args()
  args.use_param_file("%s", use_always = True)
  args.add(ctx.attr.content)
  ctx.actions.run(
    inputs = [],
    outputs = [ctx.outputs.out],
    arguments = [args, ctx.outputs.out.path],
    executable = ctx.executable.cat_bin,
  )

cat = rule(
  implementation = _test_cat_impl,
  attrs = {
    "out": attr.output(mandatory = True),
    "content": attr.string(mandatory = True),
    "cat_bin": attr.label(
      executable = True,
      cfg = "exec",
      allow_files = True,
      default = Label("//encoding:cat_bin"),
    ),
  })

$ cat encoding/test_cat.sh
#!/bin/sh
cat "$1" >> "$2";

$ bazel build //encoding:test_cat
$ cat bazel-bin/encoding/cat.txt-0.params
'п�иве�'
$ cat bazel-bin/encoding/cat.txt
'п�иве�'

@zhengwei143 zhengwei143 marked this pull request as ready for review July 19, 2023 12:36
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Performance Issues for Performance teams labels Jul 19, 2023
@zhengwei143
Copy link
Contributor Author

Potential fix for #18792

Copy link
Contributor

@tetromino tetromino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems logically correct but inefficient: writeContentUtf8 is a private method which has exactly 1 call site, so we can certainly avoid double-recoding.

I would suggest reverting the change to writeContent(), and instead changing writeContentUtf8() to the following:

      ...
      if (stringUnsafe.getCoder(line) == StringUnsafe.LATIN1 && isAscii(bytes)) {
        outputStream.write(bytes);
      } else if (!StringUtil.decodeBytestringUtf8(line).equals(line)) {
        // We successfully decoded line from utf8 - meaning it was already encoded as utf8.
        // We do not want to double-encode.
        outputStream.write(bytes);
      } else {
        ByteBuffer encodedBytes = encoder.encode(CharBuffer.wrap(line));
        ...

@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Performance Issues for Performance teams
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants